One-way analysis of variance: The F-test

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Learning Outcomes

  • The purpose an analysis of variance (ANOVA)
  • How to conduct and interpret a hypothesis test for the difference between three or more population means
  • How to use R to construct confidence intervals for all possible pairwise differences between two population means
  • How to interpret a confidence interval for the pairwise difference between two population means
  • Checking the assumptions for inference on three or more populations means with diagnostic plots

1, 2, …, k from “k” independent samples

CS 2.2 revisited: Fish respiration rates

A professor carried out an experiment to determine the best calcium level to ensure that fish have low respiration rates. The fish were randomly assigned to three tanks with different levels of calcium.

Variables
Calcium A factor denoting the calcium level of the tank, Low, Medium or High
GillRate A number denoting the respiration rate of the fish (gill beats per minute, gbpm)
respiration.df <- read.csv("datasets/fish-respiration.csv")
nrow(respiration.df )
[1] 90
split(respiration.df, ~ Calcium) |> 
  lapply(\(x) summary(x))
$High
   Calcium             GillRate    
 Length:30          Min.   :37.00  
 Class :character   1st Qu.:45.75  
 Mode  :character   Median :58.50  
                    Mean   :58.17  
                    3rd Qu.:68.00  
                    Max.   :85.00  

$Low
   Calcium             GillRate    
 Length:30          Min.   :44.00  
 Class :character   1st Qu.:55.50  
 Mode  :character   Median :65.00  
                    Mean   :68.50  
                    3rd Qu.:84.75  
                    Max.   :98.00  

$Medium
   Calcium             GillRate    
 Length:30          Min.   :33.00  
 Class :character   1st Qu.:46.00  
 Mode  :character   Median :59.50  
                    Mean   :58.67  
                    3rd Qu.:68.75  
                    Max.   :83.00  

How could we analyse CS 2.2?

Individual t-tests for \(\mu\)

Multiple two-sample t-tests \(\mu_1 - \mu_2\)

bwplot(GillRate ~ Calcium, data = respiration.df, pch = "|",
       main = "Respiration rate by the tank's calcium level",
       xlab = "Calcium level", ylab = "Respiration rate (gbpm)")

Figure: The respiration rates of fish by their tank’s calcium level

Sample variance (of a numeric variable)

Let \(s\) be the sample standard deviation, as defined in T011

The sample variance of a numeric variable is simply the square of the numeric variable’s sample standard deviation, \(s^2\)

# The R function to calculate the sample variance
var(respiration.df$GillRate)
[1] 237.0961
# Check that the sample variance is indeed the square of s
sd(respiration.df$GillRate)
[1] 15.39793
sd(respiration.df$GillRate)^2
[1] 237.0961

“Variability” (of a numeric variable)

Let \(s\) be the sample standard deviation, as defined in T011

The “variability” of a numeric variable, that is, its sums of squares is

\[ \begin{aligned} SSTotal &= (n-1) \times s^2 \\ &= \cdots \\ &= \sum_{i=1}^{n} (x_i - \bar{x})^2 \end{aligned} \]

# Calculating the "variability" as per the definition
(nrow(respiration.df) - 1) * var(respiration.df$GillRate)
[1] 21101.56

How does “variability” detect a difference?

Decomposing “variability”

The basic idea of ANOVA is to split the “variability” of the numeric variable into two (or more) distinct pieces

One-way ANOVA splits the numeric variable into two distinct pieces

Decomposing “variability”—Sums of squares

If we believe a means-only model is appropriate for the data

\[ SSTotal = SSG + SSR \]

where

  • \(SSTotal\) is the “variability” of the numeric variable
  • \(SSG\) is the “variability” between groups. That is, the “variability” between the sample means
  • \(SSR\) is the “variability” within groups. That is, the “variability” between observations after centring all groups
    Note: Some resources use \(SSE\) notation instead

Decomposing “variability”—Mean square

However, \(SSG\) and \(SSR\) are not directly comparable. Why?

Let the mean square for groups,
\(MSG\), be defined as

\[ MSG = \frac{SSG}{k- 1} \]

Let the mean square for residuals,
\(MSR\), be defined as

\[ MSR = \frac{SSR}{n-k} \]

where

  • \(SSG\) is the “variability” between groups and \(SSR\) is the “variability” within groups
  • \(n\) is the total number of observations
  • \(k\) is the total number of groups

Assumptions for inference on three or more μis

  1. Three or more independent groups
  2. Within group: Independent observations
  3. All groups have a similar measure of spread
  4. Within group: It is approximately Normally distributed

More on 4.

This means that not only the data has to be approximately symmetrical about the group’s sample mean, \(\bar{x}_k\), and there are no outliers. We also expect the shape of the distribution to be bell-like

Like inference for a single mean or two means, we can be lenient on 4. as the number of observations increases…?!?

CS 2.2 revisited: Fish respiration rates

Are the assumptions met? (Using only a plot)

Recall that the fish were randomly assigned to three tanks with different levels of calcium.

histogram( ~ GillRate | Calcium, data = respiration.df, nint = 9,
       main = "Respiration rate by the tank's calcium level",
       xlab = "Calcium level", ylab = "Respiration rate (gbpm)")

Figure: The respiration rates of fish by their tank’s calcium level

CS 2.2 revisited: Fish respiration rates

# Fit the means-only model to the data
lm(GillRate ~ Calcium, data = respiration.df) |>
  # Decompose the total "variability" between and within groups
  anova()
Analysis of Variance Table

Response: GillRate
          Df  Sum Sq Mean Sq F value  Pr(>F)  
Calcium    2  2037.2 1018.61  4.6484 0.01208 *
Residuals 87 19064.3  219.13                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Recall that the total "variability" is
(nrow(respiration.df) - 1) * var(respiration.df$GillRate)
[1] 21101.56

The F-test

Counter-intuitively, a one-sided hypothesis to test whether the observed data provides evidence against all population/underlying/true group means being the same

The hypothesis statements for the F-test

\[ \begin{aligned} H_0\!: & ~ \mu_1 = \mu_2 = \cdots = \mu_k \\ H_1\!: & ~ \text{At least one} ~ \mu_i \neq \mu_j \end{aligned} \]

The null hypothesis statement means that

The alternative hypothesis statement means that

The test statistic

\[ f_0 = \frac{\frac{SSG}{k-1}}{\frac{SSE}{n-k}} = \frac{MSG}{MSR} \]

where:

  • \(f_0\) is the F-test statistic
  • \(n\) is the total number of observations
  • \(k\) is the total number of groups
  • \(SSG\) is the “variability” between groups and \(SSR\) is the “variability” within groups
  • \(MSG\) is the mean square for groups and \(MSR\) is the mean square for residuals

Briefly: F-distribution

When the test statistic is for ratio of two mean squares, \(f_0\), we use the F-distribution to calculate the p-value

The mathematical details relevant for us in DATAX121 is that:

  • The F-distribution can defined with two degrees of freedom parameters, \(\nu_1\) and \(\nu_2\)
  • It’s the exact model for the sampling distribution of \(f_0\) if the assumptions for inference on three or more \(\mu_i\)s have been met

Figure: The F-distribution when \(\nu_1 = 2\) and \(\nu_2 = 87\) superimposed on top of simulated F-test statistics from three populations with the same underlying distribution

Calculation of the p-value (for the F-test)

Let \(F\) be the F-distribution with \(\nu_1 = k - 1\) and \(\nu_2 = n - k\)

  • \(\nu_1\) is the F-distribution’s first degrees of freedom parameter
  • \(\nu_2\) is the F-distribution’s second degrees of freedom parameter
  • \(n\) is the total number of observations
  • \(k\) is the total number of groups

\(\quad p\text{-value} = \mathbb{P}(F > |f_0|)\)

Interpretation of the F-test’s p-value

# Fit the means-only model to the data
lm(GillRate ~ Calcium, data = respiration.df) |>
  # Decompose the total "variability" between and within groups
  anova()
Analysis of Variance Table

Response: GillRate
          Df  Sum Sq Mean Sq F value  Pr(>F)  
Calcium    2  2037.2 1018.61  4.6484 0.01208 *
Residuals 87 19064.3  219.13                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Recall that the total "variability" is
(nrow(respiration.df) - 1) * var(respiration.df$GillRate)
[1] 21101.56